25 research outputs found
Dynamic Position Encoding for Transformers
Recurrent models have been dominating the field of neural machine translation
(NMT) for the past few years. Transformers \citep{vaswani2017attention}, have
radically changed it by proposing a novel architecture that relies on a
feed-forward backbone and self-attention mechanism. Although Transformers are
powerful, they could fail to properly encode sequential/positional information
due to their non-recurrent nature. To solve this problem, position embeddings
are defined exclusively for each time step to enrich word information. However,
such embeddings are fixed after training regardless of the task and the word
ordering system of the source or target language.
In this paper, we propose a novel architecture with new position embeddings
depending on the input text to address this shortcoming by taking the order of
target words into consideration. Instead of using predefined position
embeddings, our solution \textit{generates} new embeddings to refine each
word's position information. Since we do not dictate the position of source
tokens and learn them in an end-to-end fashion, we refer to our method as
\textit{dynamic} position encoding (DPE). We evaluated the impact of our model
on multiple datasets to translate from English into German, French, and Italian
and observed meaningful improvements in comparison to the original Transformer
SALSA-TEXT : self attentive latent space based adversarial text generation
Inspired by the success of self attention mechanism and Transformer
architecture in sequence transduction and image generation applications, we
propose novel self attention-based architectures to improve the performance of
adversarial latent code- based schemes in text generation. Adversarial latent
code-based text generation has recently gained a lot of attention due to their
promising results. In this paper, we take a step to fortify the architectures
used in these setups, specifically AAE and ARAE. We benchmark two latent
code-based methods (AAE and ARAE) designed based on adversarial setups. In our
experiments, the Google sentence compression dataset is utilized to compare our
method with these methods using various objective and subjective measures. The
experiments demonstrate the proposed (self) attention-based models outperform
the state-of-the-art in adversarial code-based text generation.Comment: 10 pages, 3 figures, under review at ICLR 201
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
With the ever-growing size of pretrained models (PMs), fine-tuning them has
become more expensive and resource-hungry. As a remedy, low-rank adapters
(LoRA) keep the main pretrained weights of the model frozen and just introduce
some learnable truncated SVD modules (so-called LoRA blocks) to the model.
While LoRA blocks are parameter-efficient, they suffer from two major problems:
first, the size of these blocks is fixed and cannot be modified after training
(for example, if we need to change the rank of LoRA blocks, then we need to
re-train them from scratch); second, optimizing their rank requires an
exhaustive search and effort. In this work, we introduce a dynamic low-rank
adaptation (DyLoRA) technique to address these two problems together. Our
DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank
by sorting the representation learned by the adapter module at different ranks
during training. We evaluate our solution on different natural language
understanding (GLUE benchmark) and language generation tasks (E2E, DART and
WebNLG) using different pretrained models such as RoBERTa and GPT with
different sizes. Our results show that we can train dynamic search-free models
with DyLoRA at least 4 to 7 times (depending to the task) faster than LoRA
without significantly compromising performance. Moreover, our models can
perform consistently well on a much larger range of ranks compared to LoRA.Comment: Accepted to EACL 202
ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Knowledge distillation is considered as a training and compression strategy
in which two neural networks, namely a teacher and a student, are coupled
together during training. The teacher network is supposed to be a trustworthy
predictor and the student tries to mimic its predictions. Usually, a student
with a lighter architecture is selected so we can achieve compression and yet
deliver high-quality results. In such a setting, distillation only happens for
final predictions whereas the student could also benefit from teacher's
supervision for internal components.
Motivated by this, we studied the problem of distillation for intermediate
layers. Since there might not be a one-to-one alignment between student and
teacher layers, existing techniques skip some teacher layers and only distill
from a subset of them. This shortcoming directly impacts quality, so we instead
propose a combinatorial technique which relies on attention. Our model fuses
teacher-side information and takes each layer's significance into
consideration, then performs distillation between combined teacher layers and
those of the student. Using our technique, we distilled a 12-layer BERT (Devlin
et al. 2019) into 6-, 4-, and 2-layer counterparts and evaluated them on GLUE
tasks (Wang et al. 2018). Experimental results show that our combinatorial
approach is able to outperform other existing techniques.Comment: AAAI 2021. This work has been done while Peyman Passban was at Huawe